Web - scale Content Reuse Detection ( extended ) USC / ISI Technical Report ISI - TR - 692 , June 2014
نویسندگان
چکیده
With the vast amount of accessible, online content, it is not surprising that unscrupulous entities “borrow” from the web to provide filler for advertisements, link farms, and spam and make a quick profit. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically discover previously unknown duplicate content in the web, and the second to detect copies of discovered or manually identified content in the web. Our detection can also bad neighborhoods, clusters of pages where copied content is frequent. We verify our approach with controlled experiments with two large datasets: a Common Crawl subset the web, and a copy of Geocities, an older set of user-provided web content. We then demonstrate that we can discover otherwise unknown examples of duplication for spam, and detect both discovered and expert-identified content in these large datasets. Utilizing an original copy of Wikipedia as identified content, we find 40 sites that reuse this content, 86% for commercial benefit.
منابع مشابه
Back Out: End-to-end Inference of Common Points-of-Failure in the Internet (extended)
Internet reliability has many potential weaknesses: fiber rights-of-way at the physical layer, exchange-point congestion from DDOS at the network layer, settlement disputes between organizations at the financial layer, and government intervention the political layer. This paper shows that we can discover common points-of-failure at any of these layers by observing correlated failures. We use en...
متن کاملCensus and Survey of the Visible Internet ( extended ) 0 USC / ISI Technical Report ISI - TR - 2008 - 649 b released
Prior measurement studies of the Internet have explored traffic and topology, but have largely ignored edge hosts. While the number of Internet hosts is very large, and many are hidden behind firewalls or in private address space, there is much to be learned from examining the population of visible hosts, those with public unicast addresses that respond to messages. In this paper we introduce t...
متن کاملCensus and Survey of the Visible Internet ( extended ) USC / ISI Technical Report ISI - TR - 2008 - 649
Prior measurement studies of the Internet have explored traffic and topology, but have largely ignored edge hosts. While the number of Internet hosts is very large, and many are hidden behind firewalls or in private address space, there is much to be learned from examining the population of visible hosts, those with public unicast addresses that respond to messages. In this paper we introduce t...
متن کاملal . A . Shah , Solar Cells Photovoltaic Technology : The Case for Thin - Film
, 692 (1999); 285 Science et al. A. Shah, Solar Cells Photovoltaic Technology: The Case for Thin-Film www.sciencemag.org (this information is current as of December 16, 2006 ): The following resources related to this article are available online at http://www.sciencemag.org/cgi/content/full/285/5428/692 version of this article at: including high-resolution figures, can be found in the online Up...
متن کاملFile : draft - ietf - rsvp - md 5 - 07 . txt Bob Lindell USC / ISI Mohit Talwar USC / ISI
Cisco File: draft-ietf-rsvp-md5-07.txt Bob Lindell USC/ISI Mohit Talwar USC/ISI RSVP Cryptographic Authentication Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft ...
متن کامل